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Abstract 

With the impressive capability to capture visual content, 
deep convolutional neural networks (CNN) have demon¬ 
strated promising performance in various vision-based ap¬ 
plications, such as classification, recognition, and object 
detection. However, due to the intrinsic structure design 
of CNN, for images with complex content, it achieves lim¬ 
ited capability on invariance to translation, rotation, and 
resizing changes, which is strongly emphasized in the sce¬ 
nario of content-based image retrieval. In this paper, to 
address this problem, we proposed a new kernelized deep 
convolutional neural network. We first discuss our motiva¬ 
tion by an experimental study to demonstrate the sensitiv¬ 
ity of the global CNN feature to the basic geometric trans¬ 
formations. Then, we propose to represent visual content 
with approximate invariance to the above geometric trans¬ 
formations from a kernelized perspective. We extract CNN 
features on the detected object-like patches and aggregate 
these patch-level CNN features to form a vectorial repre¬ 
sentation with the Fisher vector model. The effectiveness 
of our proposed algorithm is demonstrated on image search 
application with three benchmark datasets. 

1. Introduction 

Vectorial image representation is a fundamental prob¬ 
lem in computer vision field. In many visual analysis sys¬ 
tems, the visual content in an image is usually represented 
into a fix-sized vector for convenience of the followed pro¬ 
cessing. In recent years a lot of effort has been made on 
first designing the handcraft visual features EH [mu and 
then aggregating the visual features into a single vector 
|38l|39l[32l|20l|22l. 

The bag-of-visual-words (BoVW) model is one of the 
famous methods to construct image representation. In the 
BoVW model, firstly, a set of local invariant visual features 
are extracted on the detected image patches or the densely 
sampled grids. Then an image is represented into a visual 
word histogram based on the quantization results of local 
features with an off-line trained visual vocabulary. The 


visual vocabulary is usually trained with the unsupervised 
clustering algorithm, such as the standard /c-means, hierar¬ 
chical /c-means 1291 . approximate /c-means IlMl. Usually 
the quantization is performed by the nearest neighbor or the 
approximate nearest neighbor method. Namely each local 
invariant visual feature is quantized to its nearest or approx¬ 
imate nearest visual word in the vocabulary, which is a kind 
of hard vector quantization. Instead of the hard vector quan¬ 
tization, in (391, Wang et al. proposed a locality linear cod¬ 
ing approach to quantize each local visual feature. 

Kernel method is another alternative to transform a set 
of features into a vectorial representation, such as Fisher 
kernel (3^ . and democratic kernel (22l. Fisher kernel mod¬ 
els the joint probability distribution of the visual features 
detected in an image. The vectorial representation is con¬ 
structed based on the derivatives in the parameter space. 
Besides the quantization results in the BoVW model, Fisher 
kernel also includes the residual information between the 
local visual features and their visual words ED. Fisher 
kernel is demonstrated to be more efficient than the BoVW 
model in image classification and image search applications 
|[32j[22l|20l[T8l|33l. One non probabilistic version of Fisher 
kernel is carefully investigated in (201 EH, which is named 
as vector of locally aggregated descriptors (VLAD). 

Instead of designing the handcraft visual features, such 
as SIFT (23, SURF 0, and HOG CHI, deep convolu¬ 
tional neural network (CNN) El learns a non-linear trans¬ 
formation model from large-scale well organized semantic 
dataset, namely ImageNet CD With the learned non-linear 
transformation model, each image can be transformed to a 
feature vector (^ . With deep nets to learn from large-scale 
dataset, the CNN model can well discriminate diverse vi¬ 
sual content, which is desired in many visual information 
processing systems. With breakthrough in many computer 
vision tasks, the CNN model has made a milestone in visual 
representation and become a new benchmark baseline (36l. 

A lot of efforts have been made to understand the rep¬ 
resentation ability of convolutional neural network CSlIlol 
ESlliEaEa. Tn rm. Goodfellow et al. test the invari¬ 
ance of deep networks with a natural video dataset and find 
that the “deep” structure can obtain more invariance than 
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Figure 1. The illustration of our motivation to propose the kernelized convolutional neural network. (We refer the CNN details to the Caffe 
implementation. There should not be substantial differences from the original CNN model in (H .) (a) A simple image with a single 
object localized at the center (roughly aligned) (b) A complex image with several objects (c) The proposed kernelized convolutional neural 
network algorithm 


the “shallow” ones. In ll40l . Zeiler and Fergus try to under¬ 
stand why deep convolutional neural network works very 
well. They propose to visualize the patterns activated by the 
intermediate layers with a deconvolutional network. It is re¬ 
vealed that some complex patterns can be captured by top 
layers, which is very amazing. In 1^ . Lenc et al. study the 
mathematical properties of equivariance, invariance, equiv¬ 
alence of image representations such as SIFT or CNN from 
the theoretical perspective. In |9|, Cimpoi et al. conduct 
a range of experiments on material and texture attribute 
recognition and find that CNN can also obtain excellent re¬ 
sult on this topic. In 1261 . Long et al. study the learned 
correspondence at a fine level of CNN and reveal that good 
keypoint prediction can be obtained with the learned inter¬ 
mediate CNN features. More specifically, in Ea, Razavian 
et al. demonstrate that local spatial information of image 
is also conveyed by CNN and this local information can be 
used to perform facial landmark prediction, semantic seg¬ 
mentation, and object keypoints detection. 


However, CNN is suitable to describe these images with 
a single object localized at the center, namely those roughly 
aligned images as shown in Fig. |l(a)[ For a complex im¬ 
age with multiple objects, it is unsuitable to extract a single 
global CNN feature as shown in Fig. |l(b)| because there 
may exits geometric transformations on these objects. As a 
more reasonable alternative, we can firstly align the content 
of the image and then construct the global vectorial rep¬ 
resentation. Hence, inspired by the invariant representation 
via pooling local features, in this paper we propose to repre¬ 
sent image with local CNN to address the translation and re¬ 
sizing invariance issue and pool the transformed CNN fea¬ 
ture to achieve a fix-sized rotation-invariant representation. 


which we call the kernelized convolutional neural network 
(KCNN) in the following. Specifically, we first detect some 
object-like patches from the given image. Then for each de¬ 
tected object-like patch, we extract CNN feature to describe 
the object in it. Finally to form a vectorial representation 
of the whole image, we aggregate these object-level CNN 


features with kernel function as shown in Fig. 1(c) 


We organize the rest of the paper as follows. In Section 
we present some studies on the sensitivity of global CNN 
feature to three specific transformations. In Section we 
introduce our algorithm in detail. The experimental results 
are presented in Section]^ Finally we make conclusions in 
Section Ul 


2. Sensitivity of Global CNN Feature 

In this section, we study the sensitivity of global CNN 
feature to geometric transformations, i.e., translation, 
scaling, and rotation in detail. The study is made on the 
Holidays 1^ dataset which is a benchmark dataset for 
image search with 1491 high resolution images. We use 
the Caffe-based CNN implementation 1^ to extract our 
CNN feature. In the following, given an image /, we use 
/(•) to denote its extracted CNN feature in ”fc7” layer and 
use m(') to denote the cosine similarity between two CNN 
features. All CNN feature are L2-normalized in default. 
To reveal the impact of geometric transformations to the 
global CNN feature independently, we design the following 
experiments to make sure that each image undergoes only 
one kind of geometric transformations. 


Translation. Generally, a translation can be made in ver¬ 
tical and horizontal directions. To simplify the study, we 
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Figure 2. The experiment to study the translation property of 
global CNN feature, (a) The illustration of image translation (b) 
Two examples of the similarities of the global CNN feature before 
and after the translation transformation (c) The mean and standard 
deviation of similarities of the global CNN features with respect 
to the translation transformation 


Figure 3. The experiment to study the scaling property of global 
CNN feature, (a) The illustration of image scaling (b) Two exam¬ 
ples of the similarities of the global CNN feature before and after 
the scaling transformation (c) The mean and standard deviation of 
similarities of the global CNN features with respect to the scaling 
transformation 


consider only the translation in the horizontal direction as 
shown in Fig. The extension to the general translation 
is straightforward. Given an image / with size M by N, 
we generate a larger image with size M x 2N, as shown 
in Fig. |2(a)| and pad the left half part with image I by the 
border extrapolation method. Then we circularly translate I 
by t pixels to the left and construct its transformed version 
I{t) and extract the global CNN feature /(/(t)). We mea¬ 
sure the consistency score between global CNN features of 
I{t = 0) and I{t) with their cosine similarity, as shown by 
the following equation. 

m{m)=<fii{t = o)),f{m)> ( 1 ) 


in which < •, • > means the inner product operation. 

In Fig. 2(b)[ we illustrate two examples of the similarity 
between the global CNN features before and after the 
translation transformation. It can be seen that with the in¬ 
crease of horizontal translation, the similarity first declines 
and then grows after it reaches a valley. The decrease in 
similarity refiects the fact that the global CNN feature is 
sensitive to the translation transformation. On the other 
hand, the increase of the similarity after the valley point 
demonstrates the effect of the flipping operation which is 
make during the training stage of the CNN model. Similar 
phenomenon is also demonstrated by the statistical results 
shown in Fig. |2(c)[ The difference in the trends of the 
similarity curves refiects the tolerance capability of global 
CNN feature to the translation transformation is also related 


to the content of image. 

Scaling. In Fig. we show our experiment to study the 
scaling property of the global CNN feature. The similarity 
to measure the image scaling transformation is defined as 

m(/(s)) =</(/(s = l)),/(/(s)) >, (2) 


where I{s) denotes the new image re-sized from the orig¬ 
inal image / with the width and height being s times of 
/. To keep the image I{s) in the same size, we pad the 
region beyond the image boundary by the border extrapo¬ 
lation method. Another choice is to crop the sub-images 
different size at the same location. However, there should 
not be substantial difference between these two methods to 
construct I{s). Then we extract the global CNN feature 

fim). 


In Fig. |3(b)[ we illustrate two examples of the similarity 
of the global CNN features before and after the scaling 
transformation. It can be seen that the similarity score 
decreases as the image is scaled with different ratios, 
which means the global CNN feature is not invariance to 
the scaling transformation. Similar phenomenon is also 


demonstrated by the statistical results shown in Fig. 3(c) 


Rotation. In Fig. we show our experiment to study the 
rotation property of the global CNN feature. We measure 
the consistency score of CNN feature to rotation transfor- 
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0 = 90 ' 

(a) 


0 = 45 ° 



0 = 135 ° 


0 = 180 ° 




(c) 


Figure 4. The experiment to study the rotation property of global 
CNN feature, (a) The illustration of image rotation (b) Two exam¬ 
ples of the similarities of the global CNN feature before and after 
the rotation transformation (c) The mean and standard deviation of 
similarities of the global CNN features with respect to the rotation 
transformation 


mation as 


m{I{0))=<f{I{e = O°)),f{I{0))>, 


(3) 


where I{0) denotes the new image after the image / is ro¬ 


tated by 0 degree, as shown in Fig. |4(a)| Please note that 
the image size will change after rotation as shown by com¬ 
paring the figure of ^ = 0°and the figure of ^ = 45°in Fig. 
|4(a)| To study the property of global CNN feature when 
only rotation transformation exists, we extract the CNN fea¬ 
ture on the sub-image located at the center of I{s) as illus¬ 
trated in the blue square of the red inscribed circle in Fig. 

El 

In Fig. |4(b)[ we illustrate two examples of the similarity 
of the global CNN features before and after the rotation 
transformation. It can be seen that the similarity varies as 
the image is rotated with different degrees and the similarity 
curves of these two examples have different trends. Similar 
phenomenon is also demonstrated by the statistical results 
shown in Fig. |4(c)[ which demonstrated that the global 


CNN feature is sensitive to the rotation transformation. 
That the similarity curves have different trends means the 
tolerance ability to the rotation transformation of global 
CNN feature is also related to the content of image. 



(a) 


Figure 5. Two example images with detected object-like patches. 
Only several top ranked patches are shown. 


rotation, and scaling. This comes from the architecture of 
the CNN model in which the neurons are highly related to 
the spatial positions of the image pixels in local perception 
field. When the image is transformed, the spatial positions 
of those pixels are changed, which results in the inconsis¬ 
tent CNN feature and limits the robustness of CNN feature 
to these geometric transformations such as translation, scal¬ 
ing, and rotation. To address this problem we propose to 
firstly align the image content in the patch level before ex¬ 
tracting the CNN feature. Such a strategy makes the feature 
robust to translation and scaling change. Moreover, to en¬ 
hance the robustness to rotation changes, each image patch 
is rotated circularly by 8 times. Then to build a vectorial im¬ 
age level representation, we aggregate the extracted patch- 
level CNN features with kernel functions. 


3. Kernelized Convolutional Neural Network 

In this section, we introduce our algorithm to construct 
the vectorial representations on the roughly content-aligned 
images with the kernel method and the deep convolutional 
neural network in detail. 

Given two sets of image patches A’ and y with 
card(A') = n and card(3^) = m. Let’s consider using match 
kernel /C(', •) |[l7][71|22l to measure the similarity between 
A' and y, hence we have 


ic{^,y) = EE k{x,y), (4) 

xe^ yey 


where /c(', •) measures the similarity between two feature 
descriptors and x stands for an image patch and y has the 
similar meaning. 

To construct a vectorial image representation for each 
image, we consider these separable kernel functions. 
Namely the similarity between two feature descriptors 
k{x,y) can be computed by the inner product operation, as 


Discussion. From the experiments above, it can be ob¬ 
served that the similarity m of the global CNN features 
before and after transformation is sensitive to translation. 
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shown by the following equation 

xe^ yey 

=y] y] < > 

xexyey (5) 

= < (H > 

icGA’ yey 

where 0(-) means a kind of linear or nonlinear transforma¬ 
tion and ^(A') is the final image-level vectorial representa¬ 
tion we need. 

InEq.|^ the key issue is how to define the function (/>(•). 
Firstly, as the size of x is not fixed, we need a function to 
transform x into a fixed dimensional vectorial representa¬ 
tion, which can be denoted by 7 (-). Secondly, to aggregate 
these patch-level vectorial representations ^{x) into the fi¬ 
nal image-level vectorial representation we need a 

function to map j{x) into another space. This step can be 
denoted hy (3{•). Such that we have the form 

<t){x) = j3{'-i{x)). ( 6 ) 

In the following, we will discuss how to design the function 
7(') and^(-). 

7 (-): In computer vision, it is a fundamental problem 
to describe an image patch of various sizes into a fixed- 
length feature vector. There are many classic works on 
it (271 Uni El- Foi* example, in SIFT (271 algorithm, the 
spatially constrained gradient histogram is used to represent 
the image patch. With the development of the technology, 
some researchers turn to the large-scale machine learning 
techniques. The recently research works revealed that the 
deep convolutional neural network (CNN) is very powerful 
for many computer vision tasks (3^ . The CNN model is 
learned from a million-scale database, ImageNet. With 
the advantage of the non-linearity and large number of 
parameters, CNN can easily handle the immense variants of 
vision tasks. In this paper, we adopt the CNN model (231 to 
transform the image patch into its vectorial representation. 
In (23l, a pre-trained CNN model and well organized code 
are provided to be publicly available for academic uses. We 
adopt the CNN model to obtain the vectorial representation 
of each image patch. 

/3(*): After the image patches are transformed into 
vectorial representations, we adopt the separable kernel 
methods to aggregate them together to represent the 
image. There are also a lot of works devoted to kernel 
methods One classic separable kernel 

is the Fisher kernel which models the joint probability 


distribution of a set of features |33l|32l|3ll|20l. Perronnin 
et.al. EBEa applied Fisher kernel to image classification 
and image retrieval applications. They model the features’ 
joint probability distribution with a Gaussian mixture 
(GMM) model. In Fisher kernel, the mapping function 
/3(') corresponds to the gradient function of the features’ 
joint probability distribution with respect to the parameters 
of this distribution, scaled by the inverse square root of 
the Fisher information matrix. It gives the direction in 
parameter space into which the learned distribution should 
be modified to better fit the observed data. In comparison 
with the BoVW model, the Fisher kernel model can obtain 
higher accuracy. Hence given a set of features, we adopt 
the Fisher kernel to construct their vectorial representation. 

X. To analyze the visual content in a given image, re¬ 
searchers usually extract some interesting patches from it. 
The word “interesting” means some clearly defined rules, 
which can make the detected patches have the desired 
properties. For example, in SIFT algorithm (27l, the im¬ 
age patches are detected with different of Gaussian (DoG) 
method to obtain the scale invariant property. Then to ob¬ 
tain the rotation invariant property, the detected patches are 
aligned with the dominant orientation of its gradients. In 
this paper, we use the object detector El El m [ni to ex¬ 
tract some object-like patches from the image. After some 
object-like patches are detected, these patches are spatially 
aligned, which can provide the property of invariance to 
translation and scaling transformations. In a most recently 
published work named BING object detector ID, Cheng 
et al. proposed a very efficient algorithm to detect object¬ 
like image patches with a quite higher detection rate, which 
can process 300 frames per second on a single CPU. BING 
object detection algorithm ID output a real value for each 
patch to indicate how the detected image patch is like to be 
an object. With this real value, we can control the number 
of image patches we want. Considering the excellent speed 
of BING algorithm, we adopt it E) to extract our image 
patches. To achieve the rotation invariant property of the ex¬ 
tracted image patches, we rotate each image patch, x, by 8 
discrete degrees which consist of 0°, 45°, 90°, 135°, 180°, 
225°, 270°, 315°. Intuitively, a dominant angle for each 
object patch can be estimated in the similar way as SIFT. 
However, our study reveals that such a strategy yields low 
performance, due to the unreliability of the dominant angle 
estimation in object-patch level. Some examples of detected 
object-like patches with BING algorithm are shown in Fig. 

B 

Time cost. Besides the time cost to extract object-like 
patches with BING detector and the aggregation cost with 
Fisher kernel, the time to extract KCNN will be 8 x times 
of regular CNN, where N means the number of detected 
objects. But, this can be accelerated with GPU clusters. 
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Since our paper focus on addressing the sensitivity of reg¬ 
ular CNN, in our implementation we use the CPU mode of 
Caffe library. To fairly show the effectiveness of KCNN, 
we use the linear search method to search the database with 
the inner product operation to compute the similarity of two 
images. Therefore the complexity will depend on the di¬ 
mension of image vectorial representation. 

4. Experimental Results 

In this section, we evaluate our algorithm on the im¬ 
age retrieval application. We adopt three public available 
benchmark datasets, i.e, Holidays ll2Ql and UKBench (291 
and Oxford Building 0, to demonstrate the impact of the 
parameters in our algorithm. We also compare our algo¬ 
rithm with some other methods for image retrieval applica¬ 
tion. 

Holidays dataset ll^ contains 1491 high-resolution im¬ 
ages of different scenes and objects with 500 queries. To 
evaluate the performance we use the average precision mea¬ 
sure computed as the area under the precision-recall curve 
for a query. We compute the mean of the average precision 
for all queries to obtain a mean Average Precision (mAP) 
score, which is used to evaluate the overall performance 

El. 

UKBench dataset 1^ contains 2550 objects or scenes, 
each with four images taken under different views or imag¬ 
ing conditions, resulting in 10200 images in total. In terms 
of accuracy measurement, the top-4 accuracy 1^ is used as 
evaluation metric. For top-4 accuracy, for each query, the 
retrieval accuracy is measured by counting the number of 
correct images in top-4 returned results. Then the retrieval 
performance is averaged over all test queries. 

Oxford Building dataset El EH consists of 5062 images 
of buildings and 55 query images corresponding to 11 dis¬ 
tinct buildings in Oxford. Images are annotated as either 
relevant, not relevant, or junk indicating that it is unclear 
whether a user would consider the image as relevant or 
not. Following the recommended protocol, the junk images 
are removed from the ranking results. The retrieval per¬ 
formance is also measured by the mean Average Precision 
(mAP) computed over the 55 queries. 

Our experiments are implemented on a server with 32GB 
memory and 2.4GHz CPU of Intel Xeon. 

4.1. Impact of Parameters 

In this section, we study the impact of parameters. There 
are three parameters in our algorithm. The first one is the 
number of image patches x detected by BING detector El, 
which can be denoted by N. The second one is the dimen¬ 
sion of vectorial representation of image patch, 7 (x). We 
adopt the CNN model to construct the vectorial representa¬ 
tion of X resulting in a 4096-D ^{x) 1231 . For convenience 



(a) D=32 



(d) D=32 



(b) ^=64 



(e) ^=64 



(c) ^=128 



(f) ^=128 


Figure 6. The illustration of the impact of the parameters in the 
proposed kernelized convolutional neural network (KCNN) algo¬ 
rithm on Holidays dataset, (a), (b), (c) are the results when image 
patch X is not rotated, (d), (e), (f) are the similar meanings but 
with X rotated. N is the number of object detected with BING 
detector fS). U is the number of Gaussian functions used in Fisher 
vector model EH. D is the dimension of the CNN features EH 
after performing the PCA dimension reduction. 

Table 1. The performance of the proposed kernelized convolu¬ 
tional neural network (KCNN) algorithm on three benchmark 
datasets, namely Holidays (T^, Oxford Building EH, and UK¬ 
Bench ED" D = 128, iV = 127 are used here. 


Dataset 

CNN 

KCNN 

Non Rotated x 

Rotated x 

V=64 

U=128 

V=64 

U=128 

Holidays 

(mAP) 

0.68 

0.793 

(-^17.7%) 

0.801 

(-^17.8%) 

0.823 

(-^21%) 

0.829 

(+21.9%) 

UJCBench 

(top-4) 

3.41 

3.46 

(-hI.5%) 

3.51 

(-^2.9%) 

3.72 

(-^9.1%) 

3.74 

(+9.7%) 

Oxford 

(mAP) 

0.38 

0.48 

(h-26.3%) 

0.51 

(-h34.2%) 

0.42 

(-fIO.5%) 

0.45 

(-^18.4%) 


without loss of generality, we perform the principle com¬ 
ponents analysis (PCA) to reduce the 4096-D ^{x) to D 
dimension. The last parameter is the visual vocabulary size 
used in Fisher vector E2l corresponding to the /3(.) in Eq. 

which can be denoted by V. 

The results are demonstrated in Fig. It can be seen that 
better accuracy can be obtained when more patches (larger 
N) are used. Similarly with larger D and V, we can obtain 
higher accuracy. However the impacts of D and V are mi¬ 
nor than N. In Table we demonstrate the performance of 
the proposed KCNN algorithm on Holidays, UKBench, and 
Oxford Building datasets. We can see that it is benefical to 
perform rotation operation to image patch x for Holidays 
and UKBench dataset. Especially for the UKBench dataset, 
the accuracy for the CNN feature is 3.41 and is improved to 
3.51 (-1-2.9%) with our KCNN algorithm without rotating x. 
After performing the rotation to x, the accuracy is improved 
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(a) (b) 


Figure 7. The examples of the transformations between images, 
(a) UKBench dataset (b) Oxford Building dataset 


from 3.51 to 3.74 (+6.6%). This is because there are many 
rotation transformations in UKBench dataset, as shown in 
Fig. |7(a)| However, the rotation operation to image patch x 
is harmful on Oxford Building dataset for our KCNN algo¬ 
rithm. Similar result has also been observed when SIFT fea¬ 
tures are used to perform retrieval on this dataset lt30l 1^ . 
That is, in the construction of SIFT descriptor, better re¬ 
trieval performance is obtained with the orientation selected 
as the gravity orientation instead of the traditional domi¬ 
nant gradient orientation GH ED, since there is very few 
rotation transformations for the building images, as demon¬ 


strated in Fig. 7(b) 


To further demonstrate the performance of the pro¬ 
posed kemelized convolutional neural network (KCNN) al¬ 
gorithm, we show the Average Precision (AP) of each query 
of Oxford Building dataset in Table|^ It can be seen that the 
proposed KCNN algorithm can get better retrieval applica¬ 
tion than the original convolutional neural network (CNN) 
algorithm for most queries. There are 38 queries out of total 
55 queries (69.1%) whose retrieval performance have been 
improved. The highest improvement comes from the query 
“ashmolean_2” whose retrieval performance is improved by 
355.3% from 0.0987 to 0.4495. Some examples on Holi¬ 
days dataset are shown in Fig. in which we give their 
rank number with CNN representation and our KCNN rep¬ 
resentation. From Fig. 1 8(a) [ and Fig. |8(b) it can be seen that 
our KCNN well addresses the sensitivity of rotation trans¬ 
formation which may fail the global CNN feature. From the 
first result of Fig. 8(a) and the second result of Fig. |8(c) 


it can be seen that global CNN can also tolerate the slightly 
scaling transformation while our KCNN can do much better 
as shown in the first result of Fig. |8(c) 


4.2. Comparisons 


In this section, we give some comparisons with the re¬ 
sults reported in other research works. As shown in Table 
it can be seen that the proposed KCNN method obtains 
best result on both Holidays and UKBench datasets. How¬ 
ever on Oxford Building dataset SIFT E3 based methods 
can get better result namely mm 12^ . The reason is 
that the Oxford Building dataset consists of building images 
and the retrieval on this dataset is more like a fine-grained 
problem ll28]| . On the other hand deep convolutional neu¬ 




CNN: 1 
KCNN: 1 


(C) 

Figure 8. Some examples of search results on Holidays dataset. 
Their rank numbers are also given in the blue boxes with CNN 
feature and KCNN feature. 


ral network is designed to tackle the generic classification 
problem (TT] |24l and fine-tune is usually required for the 
fine-grained vision tasks. 

There also exists some works on performing image 
search with CNN. However our work has substantial dif¬ 
ference with them. Comparing with (361, our goal is totally 
different. Our goal is to construct a vectorial representation 
for an image while (^ use the Spatial Search that is not 
a vectorial image representation. The spatial search means 
extensively search all the sub-patches extracted on the grids 
at several levels. The search complexity will be 0{N‘^) 
where N means the number of extracted sub-patches. Com¬ 
paring with (Tdl, on Holidays, we get 0.829 mAP while ifT^ 
gets 0.802 mAP. Besides the higher accuracy, we also ad¬ 
dress the rotation transformation while ca not. Comparing 
with n, they focus on construct compressed codes of im¬ 
age representation with the retrained regular CNN while we 
focus on addressing the object transformations in the vecto¬ 
rial representation of complex images without retraining. 

5. Conclusion 

In this paper, we have analyzed the sensitivity of the 
global CNN feature to the geometric transformations of im¬ 
age such as translation, scaling, and rotation. Based on our 
analysis, inspired by the well-studied local feature based 
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Table 2. The detailed results of each query on Oxford Building dataset. 


Average 

Precision 

alLsouls 

ashmolean 

balliol 

1 

2 

3 

4 

5 

1 

2 

3 

4 

5 

1 

2 

3 

4 

5 

CNN 

0.113 

0.21 

0.274 

0.546 

0.156 

0.366 

0.099 

0.068 

0.229 

0.208 

0.134 

0.121 

0.206 

0.322 

0.302 

KCNN 

0.32 

0.416 

0.447 

0.748 

0.491 

0.733 

0.449 

0.293 

0.717 

0.272 

0.581 

0.414 

0.388 

0.603 

0.447 

Improved 

-^184% 

+98% 

+63% 

+37% 

+215% 

+100% 

+355% 

+334% 

+213% 

+31% 

+333% 

+242% 

+89% 

+87% 

+48% 

Average 

Precision 

bodleian 

christ.church 

cornmarket 

1 

2 

3 

4 

5 

1 

2 

3 

4 

5 

1 

2 

3 

4 

5 

CNN 

0.207 

0.262 

0.421 

0.463 

0.431 

0.442 

0.487 

0.363 

0.213 

0.144 

0.594 

0.223 

0.133 

0.139 

0.559 

KCNN 

0.502 

0.592 

0.535 

0.598 

0.63 

0.472 

0.56 

0.526 

0.399 

0.134 

0.426 

0.4 

0.307 

0.515 

0.578 

Improved 

-^142% 

+126% 

+27% 

+29% 

+46% 

+6.8% 

+15% 

+45% 

+87% 

-6.8% 

-28% 

+80% 

+130% 

+271% 

+3.5% 

Average 

Precision 

hertford 

keble 

magdalen 

1 

2 

3 

4 

5 

1 

2 

3 

4 

5 

1 

2 

3 

4 

5 

CNN 

0.6 

0.617 

0.588 

0.636 

0.64 

0.383 

0.482 

0.551 

0.209 

0.336 

0.104 

0.114 

0.087 

0.143 

0.081 

KCNN 

0.605 

0.604 

0.586 

0.609 

0.471 

0.696 

0.747 

0.638 

0.547 

0.226 

0.09 

0.08 

0.078 

0.1 

0.08 

Improved 

+0.8% 

-2.2% 

-0.4% 

-4.3% 

-27% 

+82% 

+55% 

+16% 

+162% 

-33% 

-13% 

-30% 

-11% 

-30% 

-2% 

Average 

Precision 

pitt rivers 

radcliffe_camera 


1 

2 

3 

4 

5 

1 

2 

3 

4 

5 

CNN 

0.563 

0.678 

0.45 

0.472 

0.519 

0.876 

0.788 

0.89 

0.889 

0.889 

KCNN 

0.651 

0.846 

0.662 

0.381 

0.621 

0.743 

0.842 

0.842 

0.876 

0.709 

Improved 

+15.6% 

+24.8% 

+47% 

-19.5% 

+19.7% 

-15.2% 

+6.9% 

-5.3% 

-1.5% 

-20.2% 


Table 3. The performance comparisons with other reported re¬ 
search works based on global image representation. 


Dataset 

EU 

m 

(221 

CNN 

KCNN 

Holidays 

(mAP) 

0.735 

0.646 

0.771 

0.68 

0.829 

UKBench 

(top-4) 

3.50 

N/A 

3.53 

3.41 

3.74 

Oxford Building 
(mAP) 

N/A 

0.555 

0.676 

0.38 

0.51 


image representation methods, we proposed our kernelized 
convolutional network (KCNN) algorithm to describe the 
content of complex images. With our KCNN method, we 
can obtain a more robust vectorial representation. Besides 
the CNN structure implemented in Caffe library, there are 
also some other emerging CNN structures 1371 [161. In the 
future, we would like to investigate the potential of these 
different CNN models on image retrieval and investigate 
the performance of our KCNN model integrated with these 
CNN models. 
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